DOCOHBIT BBSOHB 



BD 157 9*2 

iOTHOR ■ 
TITLE " 

POB DATE 
HOTE 

,EDES PRICE 
DESCEIPTOBS 



TH 007 360 



IDENTIFIERS 



Rudner,. Lawrence !!•; Convey, John J* 
hn Evaluation of Select Approaches For Biased Item 
Identification* ' • ' " ' 

.Bar 78 ^ 

38p«; Paper presented at the Annual Meeting of the 
American Educational Research Association (62nd, 
Toronto, Ontario, Canada^ Harch 27-31, 1978) 

HF-$0*83 HC-$2.06 Plus Postage* - 
Aurally Handicapped; *Comparative Statistics; 
Complexity Level; *Culture Free Tests r Bv^uation 
Criteria; Factor Analysis; Identif ication^^^^tea 
Analysis; *Hatheiatical Models; Primary Education; 
♦Test Bias; *Test Items 

Chi Square; Item Characteristic Curve Theory 



.ABSTRACT ' . . ^ - . . 

' ^ Transformed item diff iculties,"'chirsquare, item 

^characteristic curve (ice) theory and factor score techniques wer^ 
evaluated as approaches for' the identif^-cation of biased t.est items* 
The study was implemented to determine whether the approaches would 

^provide identical classifications of it^ems as to degree of aberrance 
for culturally^^different populations and classifications of minimal * 
bias for subsamt>les of a single population* Actua'l item response data 
were obtained from 2,637 hearing impaired and 1,607 normal' students, 
and tvo' pseudo-culture group samples of subjects from\ the same 
population with different mean ^otal scores* The Stanford Achievement 
Test, a 48-item test of reading comprehension,' was adnlnistered to 
the students* In the, diverse culture group comparison, subjects 
responded to a pool of items ^hich meastited reading •comprehension,, 

-an^t-^everal items were found \o be biased* Degrees of aberrance in 



the equals-culture group were consistently low .for the ice theory and 
chi-square approaches* These approaches' were felt to be the* most 
promising, and the ice theory approach was sensitive to individual 
and group bias* The factor score and jbhi-square approaches were 
inadequate for identifying bias* Several ^common items were identified 
for the diverse culture comparison by/^he transformed item, 
difficulties, ice theory, and chi-square approaches*'^Reeommendations 
are iiad€ for future studies incorporating known* parameters and 
distractor responsb analysis* (Author/jA,C) 



* Reproductions supplied by BDHS are the best that can be made * 

* from the original document* * * 
«««««« ♦^t****** ♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦ ♦♦♦♦♦♦♦♦♦♦ 



a$ DEPARTMENT OF HEALTH, s 
EDUCATION 4 WELFARE 
NATIDWAL INSTITUTE OF 
^ EDUCATION 

THIS DOCUMENT MAS BEEN REPRO« 
OUCEO EXACTLY AS. RECEIVED ^ROM 
THE PERSON OR ORGANljtATlDN ORlGlN'- 
ATINCIT POINTS OF VIEWOR OPINIONS 
STATED DO NOT NECESSARILY REPRE- 
SENT OPPiClAL NATIONAL INSTITUTE OF" 
EDUCATION POSITION OR POLICY 



An Evaluation of Select Approaches 
For Biased Item' Identification 



"PERMISSION TO REPRODUCE THIS 
MATERIAL-HAS BEEN GRANTED BY 

TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC) AND 
USERS OF THE ERIC SY'sTEM/' 



Lawrence* M. ' Rudner 

\ 

Gallaudet College 
Model. Secondary School for the -Deaf 



and 



John" J. Convey 



The Catholic University of America 



Printed in USA 



Paper presented at the annual meeting of t;he Ameifican 
Educational Research Association, Toronto/ Canada 

March, 1978 



Problan 

*• 

Afproxijnatoly 25. years a<jo, Eels'^and his c^J^agues conducted what appe^s 
- <• to be the first serious a tteiiijjt to oxamine .Itcnis for bi|is (Eells /Davis, 
Havi-^fhufst, Herrick and Tyler, 1951) and developed one of the first measures 
purported to be culture fair. Since that line, the entire- issue of cultural 
bias^in measurement has becone heated,' cotplex, and pronounced in the litera- 
turfe.' Mtions- by. tlie National Association of Black Psyfchologists, the 
i^merican' Personnel and Guidance Association of Black Psychologists, the 
American Personnel and Guidance Association, the National Education Association, 
, the National Assocation for the Advancement of Colored People, the National" 
Association of Elementary School Principlas and tlie. Couneil of the Society 
fbr the Psychologica]. Study of Social Issues calling for i^oritoria on certain 

* o 

types of tests, lanning tests, and requiring alternative plans for testing, 
indicate the serious nature of the current situation (see Willianis, Mosby 
and ILir.sen, 4977) . The concern is also apparent in recent litigation ( DeFuftLs 
vs. oaegaard,,1974; Diana vs. the California State Bdard of Education , 1970; 
Hoteai vs. Hansen , 1967) . Naturally, all, this has not gone unnotided by those 
involved. in the measuranent held. Bias and debiasing studies have ocicurred 
. and various models- been proposed in ever-expanding efforts to meet the chal- 
• lenge of bias in <3ducational assessment. 

Qne>majdr type of bias investigation is concerned with the instrunent 
as a v*jole and examines the question; Does a test undaly favor or inpede ' 
examinees fran different parts of the country or of different backgrounds? 
Another is oonceiiied with the items within a test and asks: > Which itens and 
item formats are appropriate for a given population and which may be used 
across given cultures? 

The first type of investigation is of interest to the test users who 
^ need to evalv^te the appropciateness of the test information. The todils ' 

ERIC _ , , , ' 



pi'oposed by Cleary (1968), Ttomdike (1971), Darlington (1971), Cole (1973), 
Einhom and Bass (1971) and Gross and Su (1975) (also see the entire Spring 
1976 issue of the Journal of Educational Measiirement) exatplify this first type 
of -investigation. The second type of investigation is of interest to developers 
as it assists them in developing valid and cross-culture fair iteins and provides 
a framework for constructing better tests in s\±)sequent efforts. By identifying 
and reroving such iteros fran an initial item pool, test developers q:>uld# 

theoretically, .develop a measure free of bias. The work of Arjgoff (1972) 

' \ 

Cardall and Coffiien (1^64) , Green and Draper (1972) , Merz (1973, 1976) , 
Rudner (1977a), Scheuneman (1975) and Veale and Foreman (1975, 1976) (see the 

V 

reviews by Merz, 1977 and Rudner, 1977b) ,have been directed at this need. It 
is this second-<^pe-of bias — item bias — which the present paper addresses,. 

Typiccilly, Jbhesfe researchers have adcpted a single approach and used ' 
that apprcjach exclusively in their work. As a result, studies applying more 
than one approach to ^ single set of data have been sparce. Hiis situation 
has led to the problem identified by Merz • (1977) and addressed fay this study: 
the .psychometric properties of the approaches have not been fully evalx^ated 
using I^^thetical and actual item response data. 
Purpose , . ^ 

The "purpose of this study was to investigate the following four approaches 
to biased item identification using oontiDn sets of actixal iton response data: 

/ 

1. Transformed item difficulties in which within group p-values are 

* * 

standardized and ponpared Bei^een groips (Angoff ,.1972) ; 

2. Chi-square in which individual items are investigated in terms of 
between grovp score level differences in,e5q)ected and observed 
proportions of correct responses (Scheunonan, 1975) ; . 



\ 

1 



3. Item characteristic curve theory in v*iich differences in the 
prpbabilities of a correct response given examinees of the same 
xmderlying ability and in different culture gro\5)s are evaluated 
(Rudner, 1977a); 

4. Factor score in which item bias is investigated in terms of loadings 
on biased test factors (Mdrz, 1973) . ^ ^ 

The investigation addresses the following questions: 

1. Do the select ^jproaches provide identical classifications of itans 
as to their degree of aberrance when applied to item response data 
corresponding to tWD- culturally different populations? 
This question calls f or a' ootparison of the approacbes as they would tyj^j^lly 
be applied in test developrtent or tes^vsCLuation studies. 

, 2. Do the select apprpaches provide classifications of inindjnal bias 
when applied to sxabsaitples of a single populatiOTi? 
Thi.s question is similar to cxie asked by Jensen (1973) and serves to evalu- 
ate the ^tSequacy of the various approaches. Here, an approach identifying 
an abundance of items ,as biased woxald be suspect as being inadequate. , , 

The Mcx3els 

Transformed Item Difficulties * * 

^This approach, v*iich examines the interaction of item and giroipS, - 

appefurs to be ocvz of* the best known". It has been advocated aiid used frequently 

by Angoff (1972; and Ford, 1973; and McxJu, 1973) and others (Green and Draper, 

1972; Jensen, 1973; Hicks, Donlon, and Wallmark, 1976; StJ(^ssberg-Roseni>erg 

*. ' • ' , ^ 

and Donlon, 1975? Editemacht, 1975; RutSner, 1977c) . 

In this xoethod/ p-values for a groip of items are obtain^d^or two 

different groips of examinees. Each p-value is converted to [a normal deviate 

and the apris of nontal deviates, one pair for each item, are plotted on a 



ERIC 



4 ^ 



bivariate gr^h, each pair represented by a point on the qraph. 

The plot will generally be in the form of an ellipse, A 45 degree line,, 
passing throix^ the origin, provides an indication of the absence of bias. 
Iteros greatly deviatixig f ran this line nay be regarded as esdubiting *an itou " 
^ groi^) interaction. That is, relative to the other iteuis, deviant items ' 
are e^)eciallylnx)re difficult for manbers of *one grow) ^>an the other. ^- 

Assuning both groqps received similar .instnictionSi siadj itans'wopld appear to 

1 

represent different psychological meanings for the two groins of exatdnees. 

, Since the intent is to nake oonparisons of between-grovp^diffprenoes in 
item difficulty, it is neces.sary to ^transform the proportion passing an item ^ 
to an index of item difficulty vrtiicfi constitutes at least an interval scale. 
This is acocnplished by exptessinq each item.p-value in tern^ of within-group ^ ^ 
deviations of a normal curve' (see Guilford, 1954, pp. 418-419) . 

*The distanced an item point to the line oen be treated as -a measure of 
the degree of item bias. One can determine vMch items are "greatly deviating", 
from the. line by incorporating outlier or residual analysis. One method is 
to place confidence limits on the. line by using a multiple of the standard 
error of estimation. An alternate ^proach, adopted by Strassberg-Rosenberg 
and Donlon (1975) and Hicks, et al., (1976) involves oonputing the standard 
deviation of the residuals and classifying as biased those items deviating 
by greater than 1.5 standard deviation' units. Rudner (1977c) has enployed 
a fixed item-regression line distance of .75 z-score units. 



Insert Figure 1 about here 



An exaitple of the approach is shown in Figiare 1. </rhe transformed p-values 
have a correlation of approximately- .90, making ^the plot relatively long and 



c 



flat. The solid line represents the main axis and the dotted lines represent 
Imear oonfi^ioe liMts. The item represented in the xjpper left, outside* the 
oonfidenoe interval, woiald be considered biased. 
Chi-Square 

This i^roach to biased item, analysis determines v*iether examinees of 
the same ability level have the same' probabilitx of a cx)rrect response regard- 
less of cultural affilation.^ This is 'accorfplished by dividing .the tryout^ 

* sanples into groi:ps based on their observed score and oonparing "the proportioriS 

* /• 

»• 

of students vdthin each level responding correctly with a chi-square test for 
^independent observations (Scheuneman, 1975, 1976; Green and Draper, 1972) . An 
item is considered tmbiased if, for all individuals in the same total score , 
interval, the proportion of correct response is the sane for both groips under, 
consideration. .A ntxiified chi-square test determines the probability that an . 
item is unbiased by this dpfinition. ^ , 

" Scheuneman ''(1976), "in applying the^roach to several sets of data, 
advocates using four or five total score levels based on the score distribia- 

0 

tion of the snaller sanple (Green and Draper hSd used viithin-grovp quintiles) . 
Itan Characteristic Curve Theory , . ^ 

Latent trait or item characteristic cuj:ye (ice) theory relates the 
probability of a correct item response to a *f unction of an examinee '.s underlying 
ability level (g^) and characteristic (s) of the item. While the various models 
'(Lord, 1952; Rasch, 1960; Bimbaum, 1968; Urry, 1970) differ in terms' of the 
number of item parameters considered; they all describe the itati parameter (s) 
independently ,of the examined sanple. Pull develc^xnent of these and other - 

^ / o , . ' 

mental measiarement models can be found in Haiibleton and' Cook (1977) . 



J 

This nodem measurenent theory has been usQd to identify biased items 
(Greeai and Draper, 1972; Pine, 1976; Lord, 1977; Rudner, 1977a). » In an early 
study, Green and Draper (1972) had used observed total scores as estinates 
of ex^Kiinees' abilities, ©i's, and the proportions of examiiiees responding 
correctly at each total score level as. estiitates of,'P(Ug=l|0j^) . . Their 
procedure called* f6x plotting estinates ice's* for each iten separatfely for 
each culture groi:p and conparing the plots.* 



Insert Figure> 2 about Here 



, . ^ this and other latent trait theory ^preaches, an item is unbiased if ^ 
examinees of the^si^ ability level, but of different cultural affiliations, 
have equal probabilities of respdnding corjrectly. That is, ah item ,is unbiased 
if the estimated ice's detained from the .various culture grotps are identical. 
As an exai?{)le of a biased itan, consider the two hypothetical curves -.shown in 
Figure* 2. These curves are based on responses by different culture groips 
to the same item^ Tcjtal observed scores are used as^estiraates at ©i and pro- 
portions of examinees' responding correctly are. used as ^^stimates bf P(ug=l.|0i) . 



The curves are not identical, since the location parameters for the two curves 
are ^x>t equal . Such an item can be considered biased*^ in that of cen examinees 
of the same ability level, e.g. = 58%, but fran different culture gmvps, 
do Hot have similar proportions of correct responses. While this approach 
^is appealing, total c±>served scores are directly" incorporated and qtiantifxcation 
of the degree of item bias is difficult (an eyeballing procedure is used to 
identify a "very biased itaii") . 

Rather than using total c±)served scores as estimates of Oj^ and proportions 
as estimates for P (Ug= 1 1 ©i) ^ more accurate values can be obtained using one of 
the recent methods of parameterization (Urry, 1975;^,<Jingersky and Lord,* 1973) . 



During paranCterization, the metric used 'for the 0 scale is defined* by the 
ability variance, in the examined sanple* In order to oonpare parameters 
obtained from two different examinee^ groips, the obtained values must be equated* 
Lord and Novick (1968, diaprter 16* 11) have shown that 'this can be acocqplished * 
by oonputing the regressions of the parameter values based on one groi^J^of 

» e 

examinees on the parameter values based on the other groi^p of examinees* 

* Rudner (1977a) has refined the procedure *used by Green and Dr^jer to 
identify biased items by incorporating equated ice parameter values. The area 
between, pairs of equated ice's is used to indicate the relative amount of 

aberrance for each item and eyeballing of the equated ice's is enployed to pro- 

' . ' ' ^ J ^ 

vide additional information as to the nature of tlie aberrance. 

Factor Score ' . [ 

In factor analysis , underlying factors (i.e.; 'dimensions or traits)' are 

hypothesized and -the correlations of each variable with the hypothesized factors 

are oonputed.- In an achievement test, each item is treated* as a variable. 

Svcti an analysis could be conducted twice vising. examinees from two different 

cultural .backgrounds. Ideally, the togo separate grotps of examinees would 

•yield similar sets of item-trait correlations (factor loadilngs) . Different' sets 

of factor loadings would indicate that the two groips are not responding to the 

items in ^ the same -manner. Stxih a test would be considered biased in that it 

appears to measure a different trait across groi?)S. The items exhibiting the 

most bias would then be those with the largest differences in factor loading. 

Merz' (1973, 1976a) has suggested an approach v*ach incorporates 

factor scores and analysis of variance. In this approach, the item responses 

for the grotps are contoined, factor analysed, and factor scojies for each exam- " 

inee on each factor oonputed. These factor scores are then subjected to an 

analysis of variance, with groip meirbership being the independent * variable. 



8 



Where significant itean differences are found in factor scores ^ the factor is 
classified as bi^ed. Biased items are* defined as those with high factor 
loadings on a biased factor. * ^ • 

• • * 

. • : ■ \ • . V • ^ 

Item Sanple ^ * — . ^ 

The 1973" Stanford Achievement^ Test, Form A, Prirory 2 Battery, 
Reading COitprehension Subtest (SAT) , ~ vMch, item for item is equivalent • 

to the Stanford Achievement Test - Ifearing Inpedred Vetsion, Level '2, Reading 

t 

Cotprehension Subtest — fprmedjthe item pool for use in this stxady/^' 

The SAT consists of 16 pa^graphs with a total of 48 ^fourTchoice items. 
According to the test publishers, tiie Psychological Corporation, . reading vocab- 
ulaiy is geared to tWe primary grade "levels ^ en^tesis is placed oh conpre- 
•'hending disconnected/disoourse. It was anticipated that the SAT would contain 
several items- biased in favor of one, of the incorporated culture groip sanples. 
Examinee^ Sanple s . " 

Item responses made by large samples from two diverse culture groips • 
vere used in the study. The first culture grovp was cati»sed of 2,637 s Jdents 
in programs "for t3ie hearing iitpaired across the United States. The scores on 
• the SAT for this grotp v^re c^roximately nDrmally distributed vath a mean 
of 21.6 and a standard cteviation of 7.42. This culture groxjp was divided 
into twD sviDgroi5)s-by randomly assigning the examinees to one -of two indepen- 
dent groups with significantly differen^^>sto<.01) mean total scores. Both 
subgroips were approximately normally distributed. " The first subgroip con-- 
tained 1,079 examinees wit;h a mean of 23.7 and standard deviation of 7.43. 
The second subgrovp contained 1,030 axairtine^s with*a mean of 20.9 and^a 
Standard deviation of 6.97. Since the examinees were fran the sane culture 
group, the expected. degree of aberrance for each item was zero. That is, 

1 . . 

•■ ' • id 




the approaches wexe e>^jJ:ed<>to be insensitive to the differential performance 
of the examinee, groips^ caisistently identic itiem aberraiice as minimal. 

The second^^^ture groip, representative of the speculation for vMch 
the SAT vas desired, vas cotposed of 1,607 examinees from a large vgest coast 
public school^^stan.^ This scores aa the SAT for this hear^g groip.were 
bimodallydi^sfxibuted^with modes-at 16 and 44, and nean of 28.9 and 12.44. ^ 

* One major difference between these two culture groi?)s is their expo^sure 

' - . ' " I ^ • 

to, and their ability- to use, the English language (see Stoke,. 1976 for an 

excellent discussion on ^iie social and cultiaral characteristics of the ' 



hearing iitpaired) • 



Thus, aside frotn cultural differences, the two groi^)s of 



examinees gfeatly differed in their mean level of ability as measured by total 
scx)re on the SAT. \ • 

Procedures , " i ? ' * - 

The degree of bias, for each j.tem withinj thfe SAT was identified by applying 

a select approach within the transformed iteni difficulties, ice theory, factor 

score ^ chi-square categories to item responses nade by (1) the tvyo diverse 

culttire group sanples, and (2) two equal culture group sanples. 

' Each item bia?' detection approach was applied to item, responses made by 
\ . > » • 

these culture group pairs in the following manner: " 

^ transformed item difficulties — IVo sets of item p-values were oompu^ 

for -each culture group pair and transformed to within group normal deviates.. 

Fran the bivariate scatterplot of the sets of transforned p-values, the.abso- 

lute values of the magnitudes of the iton residuals, i.ie. the item-45 degree, 



line distances, were ocnputed. This residual magnitude served to indicate the 
relative amounts of i'tem bias. * ^ 

ice theory — Tw3 sets of it^ ice parameters as defined by Bimbaum's ' 
three parameter logistic mcx3el.\\^e estimated for eac^ of the SAT items by 



separately applying the^ Urry (1975) iterative miniihitsn chi-square prod^ure 

to the item responses of each of the twD culture groi?)s. Tht3 parameter 

Value estimates^ were then equated by oonputiriq'Hie between group linear,. - 

regressions for the difficulty and discrimination parameters. The aregjs 
* * • 

' between estimated equated ice's, as approxirrai ed by; 



V 

V 



^'^.000 



^ -5.000 ^ ■ • ■•- • 



' ' where P(Ug-l|0^) ^ P'(Ug»-l|0^)L^fji5e the est^ " ". 
V ' equated ice's • ' , <■> * v . / 

\ . and A0. = .005 

served to indicate *the extent of 'item aberrance. ' ^ * -f . ' 

factor score — The item, 'responses on the SAT irade by the two culture 
groups, within each pair were' oocrbined md inter-itori productMicmeiit oorrfelatioriSs 
ocnputed. The resultant matrix vja? then reduced lasi^i principal ooqpbnent' 
factor analysis witli an eigenvcdiae criterion of.d.O; ■ Ulie faptor matifix v?as 
rotated orthogonally (variiaax) to siinple structure" and. factor scores for each 
examinee on each factor coiiputgd. Separate t-tests V^e computed ^liSirig, each 
\ set of factor scores as dependent variables and group mentoership. as the, inSer 
pendent variable. Factors for whibh there were^ significant (^,<.1)01X differences 
between mean culture group factor .scores were classified' as biased-. The • : 
magnitude of the factor loading (\\) on such factors served as indicators.of 
the magnitude of item bias. <{>g was then defined" as-the maximum Item factor 



loading on factors clii^sified as biased. That is, . ^ 

A_ = max [a .] • j ^ 1, 2, 3 . . . number of biased factors* 
chi-square — Each item was tested individually, for bias* using a modified 



. . • .. . 11 

/ . ♦ 

chi-square tedmique vath,i = 2 culture groxps and j = 5 total scx^re intervals. 
By ^ this agproach, the expected values for each cell (E ) were obtained by 
nultiplying (1) the proportion of all examinees with total scores within inter- 
^val j responding* correctly to the itan by (2) the nxjiriber of examinees within' 
the cell; That is, , ' ^ 

\ ' ♦ •^ij^-^i^]' ^^ij^ i = 1, 2 ] = If 2, 3, 4, 5 

vAiere 0. . is the nxjriDer of examinees in total score interval, 
. j responding correctly 

N.^ is the total nurrber of examinees in interval i 

Nj^^ is the total nutrber of examinefes in Group i and score 
• ' * . scoife interval j. . . " 

y As with a conventional chi-square, obsein/ed oell valines were sinply the nuntoer 

of examinees within the cell responding correctly to the itofn. I'or each item, 

the magnitude of' -aberrance -wa« indicate^ '(ij by the value of the resultant ^ 

• ' ' ' 

X^. and (2)rby one minus the prc±>ability associated with the X^. 
Statistical Anal^is . y * , ^ J 

Statistical and grajMc analysis were conducted to obtain, a global 
perspective of the 'similarities and differertces fmong the methodologies. The 
following analyses were enployedr - r • ^ 

1. The relative ariount of similarity between pair's of apfjroaches ^s 
detennined Irrespective Pearson Prodixrt-Mcroent correlations. 

2. The identified .degrees of bias were 'oCHipared} item by iton, ty 
examining gratis in v^ch items are represented on the abscissa ancJ^ degree 
of item bias on the. ordinate'. , ' --^^ * 

Diverse Culture Group Conparison 

The indices of aberrance for each approacdi to biased item identification 



ERIC . ; ij- 



12 



'^(for the diverse culture girovp- cx^nparison are given in Table 1. In the IOC 
-approach, two items, 21 and 44, could not be parameterized because of near 
ze3CO' item^test-correlations, and hentoe could not be evaluated. Seven factors 
with eigenvalues exceeding unity were extragted by the principal conponents 
analysis and rotated oi±hogonally. Significant differences (p<.001) between 
the mean factor score for the two culture groups were fomd for six factors. 
Table 1 shows! the maxinuim factor loading for each iten on one of these srx 
fabbors. /The values for the. Transformed Item Difficulties ranged from ,.04 to 

1.25. : ■ ^ ' ^ 



/ 



Insert Table i about here 



Becaxase of. the dissimilar total score distributions, a problem was 

encouhtered in applying the chi-square approac]^^ Iiutially, five observed 
* ' * • 

SQpre intervals were defined for each item acoor^Jing to the nunto^ of examinees 
. in the hearing sairple that responded correctly to the iten. This resulted in 

highly disproportionate numbers of hearing inpaired examinees in each interval. 
•Also, defining intervals base<i on the item response distributions of the 

hearing ihpaired examinees resulted in highly disprc^rtionate nurrbers of hearing 

examinees^ in each interval. A oarpranise was achieved- J^y averaging the pro- 

i " % ' 

portions of examinees. responding correctly to the item of each observed scorfe . 
levels across groups, and using four intervals instead of five. ^ 

In addition to losing the value tp indicate the relative amount of 
aberrance, one minxas the probability associated with the chi-square was used. 
Both indices are included in Table 1. The use of the probability valutas 
an' index identified 56 percent of the items in the SAT as substantially aberrant 
.at (l-p)> (l-.OOl). . ' ■ ■ « 



ERIC 



ii 



13 



Insert Table 2 about here 



The corre3,ations between the indices of aberrance for each method in 

; . , ^ ' — . 

the diverse culture group oonparison3 are given in JTable 2. - The chi-square - . v 
ICC (.67) and the chi-square - transformed item difficulties (.59) correlations 
.were significant a€ p<.01. \A11 correlations involving^ the chi-sguare and 
transformed iten dif ficiiLties approaches .were significant indicating some degree 



of similarity between each of these approaches and the other mDdels. The 
factor score and chi-square^ (l""?) approaches showed the lowest degree of ^ 
similarity with the other approaches. The average correlation ofj3aehr;6f these 
with the other approaches was .29 and .^5, respectively; while the average 



correlation with other approaches for the chi-square (x2) , transformed item 

. difficulties, and IQC approaches were .48, .37, and .36, respectively. 

Equals-culture Groip Coniparison ' 

The indices of aberrance for the item responses in the equal-culture 

^ group comparisons for each appix>ach are given in Tablp 3. ' The transformed 

item difficulties correlated highly (r*= .98) and all the perpendicular item 

itein axis line distances were minimsil. The naximum distance was .28. No 

itenis woiold appear to be identified as biased by this ^approach. ^ 
♦ ** » 

» 

In the ice approach, again items 21 and 44 did not fit the mSdel'and 
^ could' not be evaluated. Items 28 and 39 showed ^the most aberrance with values 
of .51 and .74, respectively. ?oth of these items showed less aberrance in 
the diverse culture groip donparisons indicating possible mxsclassif ication 
by this approach. * ... 



Insert Table 3 about here 



ERLC 



14' 



ERIC 



Fourteen factors with eigenvalues exceeding unity wsre extracted by the 
principal oqttponents analysis and rotated orthogonally. Significant differences 
{p<.0Ol) between the mean factor scores for the* two equal-<mlture groijps were . 
found for three f actors^. The maxiiraih factor loading for itens on these three 
. factors ranged between ,06 and •72, This range is about the same as the 
range noted in the diverse culture group oornparisons* 

Using the^ chi-square ^roach, five total score intervals were defined 

based on the average prcportions of examinees responding correctly* The.ichi- • 

' • ^ 

square values c*)tained were considerably smaller than the values obtained in ^ 
the di\^se culture grovp coqparisons, and no items wDuld have been classified 
as aberrant at the .05 level. ' " , - 



Insert Figure 3 about here 



Figure 3 gives a plot of the aberrance indices for each it^ for each 
afproach 4n the diverse culture groiip ccnparison and the equal-culture gltoup 
oorparison. It is apparent from Figure 3 that for each approach the variance 
of aberrance in the'equal-cblture groip ccnparison is less than the diverse 
culture groi?) ccnparison. In the euqal-culture groip oonparispns/ bbth the 
factor score approach and the chi-square (l-p) approach appear to have an 



undesirable amount of variation. 

DISCUSSION 



^The fliveirse culture groip oonparison illustrated the ^pproache^ as they 
might.be applied in actual test development. Large nuirbers of examinees from 
tv^o different populations responded to a pool of items purported to measure 

the same ability - reading conprehension. Each approach identified a degree ' \ 

" i - * - . ' \ 

of Item aberrance for each item. The results show that there was sane agreenent ' / 

in terms of the identified degrees of aberrance between (1) the transforiTed 



item difficulties and chi-square (magnitude) approaches and (2) the ice theory 
and chi-square (magnitixte) approaches, although the agreement was not pver- 
^vghelming (r = .59 and r = .67, respectively) • One mii^us the probabilities ^ 
associated with the X^'s and the factor score ^roach showed little agreement 
with any of the other methodologies. ' ^ * 

Whether the identified degrees of aberrance are in agreement has little. - 
direct meaning in test develc^iteritr^ A more pertinent question is: Do the 
^3pfoaches lead to the same depisions with regard to v*iidi items to classify 
as "veiy biased"? If the^ ansvfer were^.in the affirmative, the most, appealing 
approach would be the simplest one. Table 4 illtistrates v^ch items woxild 
be classified as "very biased"- by the ice tlieory, tr^fonned item difficulties 
and chi-square (iiagnitu3e) approaches under the following decision rules: 

(a) ice theory - area > .50 , 

(b> transformed item difficulties - distance >^ .60 ' 
^ (c) chi-^square (magnitude) - > 55.0..^ ^^ . 

These decision rules were determined by identifying, from Fi^e 3 cut-points 
which Appear to define outliers. Since the variances of the identified 'degrees 
of aberrance fo^^^the factor score and chi-sc^uare (probabilistic) approaches 
were small, any - reasonable cutTpoint would have resisted in large numbers of 
itenis being classified as "very biased" thus these apprxDaches are not included 
in the table. ^ 



Insert Table 4 ^>Dut here 



From Table 4, it is apparent that the approachfes, under these 

*. 

decision rules, would have conmonly identified itans 16, 17, and 22 as "very 
biased." Two af^roaches would have identified itens 4, 15, 18, 26, 27, 30f 



and 4j5 as being biased. Items 8, 23, 24, 25, 29, 44 and 47, however, were 
identified by only one approach. ^Jore conservative, or more liberal decision 
rules woiold still have resulted in different sets of items being identified. 

Since there is sane disagreement among the ^proaches, the results of 
"the equal-culture groip ddrparison warrant closer examination. Th§ two 
groups of examinees in this oonparison were from the same well-defined pQpu- 
lation; n3mely, students with a hearing loss sufficient enot:igh to warrant a 
special educational program*- As sudi, ftem bisb betiween these two groips 
is by definition ^miniital, and the e^q^ected amounts of aberrance identified for ■ 
each itan by each approach is assumed to be^zeto. - " ' . 

V Of the approadie^, only the transformed 'i tan difficulties approach fully 
met this criterion. The identified degrees of abearance from this approach 
were snail, and .by any reasonable decision rtile, no items would have b^en' 
classfifi^ asfbiasedl Thus^ the model behaved as emected. The identified 

■ ■ • '■ 

degrees* of item aberrance as^ indicated by the ice theory approach were also 
minimal. Hows^/er, two items could not 'be evaluated &nd two.itesihs would have 
been icfentif ied as having fair amounts of aberrance under a liberal decision 

/ 4 

rule. . . * ' . , 

The ice theory approach unexpectedly identified items> 28 and" 39 ^s oon- 

taining fair amomts of bias. A closer examination of these items reveals that 

their latent trait item difficulty parameters were extreme for the second 

groip of examinees, namely 2.77 and 3.91 respectively.* This can be. loosely 

, interpreted as meaning that, ignoring gue'ssing, an examiiiee's ability must be 

^.77 (3.91) standard deviations above ,tte groi:p irean ability to have a better 

than average chance ^of responding correctly. Since relatively few examinees 

were of this abilitj^ level, paraiteterization became tenuous and the slight 

♦ « 

aberr^ce in these items is probably due, to abnormally high parameterization 



17 



error. TKus, this approach is liable to yield spurious results when item^ 
difficulty is exttenely high or low.. It should be noted that the nunober "of 
iten\s in the ^SAT is really insufficient* for a proper evaluation of the ice 
approach. Fran a Mcsite Carlo investigaticxi of 'the Urry parameterization 
procedure, Sdtarirdt and Gugel' (1975) have recomended.that a miniimm of 60 itons 
*^and. 1,000 subjects be used to obtain accurate parameter estimates. Since the 
SAT contains only 45 itenis, the parameter value estimates may have contained 

more than the usual aotounts of error.. , * 

\ y 

Items 21 and 44 "had extremely low^ itan-test point biseriaL correlations, 

T ' ' - ' > \ ' ' , 

vMch iitplied/that ability Was poorly related to the prc±ability. of a correct 

re^nse. Such items cannot fit the Bimbaun model and hence cannot be eval- 

uated for bias with ttie ice theory approach. Although such items are usually 

the first to be eliitujiated in test development, the fact that' these items * 

/ 

cannot be .evaluated illustrates a weakness in the abroach. 

\ ] * ' " 

The chi-square approach in .the equal-culture group oonparison produced 

' wide flixrtuations in the probabilities cissociated vdth^the X^*s used to test 
the null hypothesis of no bia^. However at p<.05, t(l-p)>.95], no items were 
siuspecbed as being biased. Thus, althoi^h 56 percent of ^ items 
were identified as biased in the diveriserpulture group oonparison, in terms 

. of the equal-culture groip oonparispn, the chi-square approach appeared to be 

sxif ficient when either ^probabilities or nagnitudes were enpiqyed. 

The factor score ^roach identifies aberrar.t'items as those having a 

major loading on a factor v^ch yields^ unequal mean factor scores. In the 
, tf 
' equal-^culture groip oonparison, three sets of mean factor scores were identified 

^\ ' . . 

as uneqiial at conservative values (p<.001) . The maximum loadings of many items 

on these factors werje high, several being hi^er than the maxiHiim loading in 

the diverse culture groip oonparison. ,The approach, as applied to the data 



\ 



ERIC 



in this study, produced unsatisfdctory results ixi the equal-culture group 
catparison* • ^ 

^ . The above discussion has pointed out that there v^e differences beto-^^een 
the approaches in the identified degrees of aberrance in both the diverse- 
culture grovp and equal-culture groi?? oonparisons. . .Of the methodologies, 
the .-transformed itan difficulties and icc theory approaches appear most 
attract4.ve« In the diverse-culture grovp oonparison several ^ items were iden- 
tified as biased, and in the equal-culture groip oorrparison, the identified 
degrees of aberrance v^ere minimal. The factor score approach did not identify 
much variance in, item bias in the diverse-cxiLture groip ocnparison '^d yielded 
major loaidings in the equal-culture' gro]Lp occnparison. Using a conservative 
probability level (p<.d01) the chi-square approach * identified 56 percent of 
tlie items as biased in the diverse culture grovp oorparison and yielded wide 
fluctuations in the amount of aberrance in the equal-culture groi^ ocnparisons; 

Th ese ^^later two approaches - the chi-squaire approach and the factor 
square approach —both incorporate significance testing 'of large amounts of 
data. 'The chi-.square approach examines the hypothesis that the proportions 

4 J * *' ' " 

of examinees responding correctly are identical across individuals in the 
' same observed score inten^l and of different cultural classifications. The 
factor score ^roach incorporates the hypothesis that the group, mean factor* 
scores are identical across the defined culture groups on each factor. With 

sanpl^es as large as that used, in this study, hypothesis testing may not be 

/ . 

, ^}propriate. The sanple values are such that they dan be considered 
population values and small differences are statistically significant. 

In the diverse-culture groip ccnparison; the values correlated with 
the distances of the transformed item difficulties approach and the areas of 
the icc thfeory approach. However, their magnitudes were extreme. It should 



19 



ERIC 



be noted that in the diverse culture group toonparison, the' total score distri- 
blitions of the examinee saitples were quite divergent* - In the equal-culture 
**groip comparison/ the da.stributions were not as different and the values 
were svibstantially less. 

The chi-square ajproach analyses the itSem response dkta in tenns of 

ciiiserved socre interyals. The observed value for an interval and culture 

/ 

grotp is siirply the nurtber of examinees in the interval and culture groip 
responding correctly to the item. The ejqjected vafue for a culture groip and 
intervaT is the product of proportion qf all examinees- in the interval respond—. 
Ing gorrectly. to the item and the niirttoer of examinees in the culture groip and 
in the interval. Thus, the expected value will be inflxjenoed by the culture 
grotp with the 'gireater- niinber of examinees in the intierval \^ten the observed 
score distributions are different. Since the item interval definitions are 
often siifdlar, this will result in a near systematic inflation of the 
values. 



• Insert Table 5 aboiat here 



An" exanple of how total score distributions affect the expected interval 
\teLLuss (and consequently the X^ values) is illustrated by the hypothetical 
itari response data shown in Table 5. Here/ the total observed score distri- 
butions are quite different. Groip 1 has more than five times as ireny examinees 
in the interval as does Groip 2.' Further, the total number of examinees at: 
each total score level witliin the interval decreases ds ]total score increases 
for Groip 1 and increases for Groip 2. However, the proportions of , examinees 
respDnding correctly to the item at each total score level ^re identiccJ- across 
groups. That is, the two groups perform identically within the interval -arid 
their total score distributions are dissimilar. If the aj^roach were not 



21 



20 



\ 



ERIC 



sensitive to total score distributions, the ^observed 'and expected values for 

each group vcould be identical. Hcwev6r, the obserx'ed and expected values are: 
* ' * ■» ^ • * * 

136 *■ 31 

for group 1, 0, = 136 and E. = 480 + 90 • 460 = 140.6, and 

■ . 

136 4-^31 

. for group 2, O2 = 31 and E2 = 480 + 90 * 90 = 26.4 

' Even though the two groups performed identically at each total score 

level, the observed and fexpected values are unegual and would have inflated 

2 ^ 
the X value. Had dif;Eerent distributions been employed/ differ^t expected 

values arid a different would have been'-diefined. A / 

The inflation of the values will be systeroatic \f*en identical inter- 

vals are used for each item. This systematic inflation allows the X^'s to be 

used as* a relative index of bias. Even though the inflation was not perfectly 

systematic, ^the magnitudes of the X 's in the. diverse culture grou?) comparison 

correlated well with the areas of the ice theory approach. Had the distribu-: 

tiohs of the examinee groups* been identical, there would have been nq distor-* 

tion of the X 's and significance testing would have been meaningful. Under 

-J 

such instances, one would e}?pect an even higher, correlation. 

The factor score approach entails many decision points which will affect 
the results. In this study, . phi-correlations of the ccmbined data> principal 
coiponent analysis, eigenvalues greater than 1.0, varimax fbtati'Qn,''.and prob- 
. abilities less tlian .001 were'iised, and the results -appeared to be unsatis- ^ 
factory. In the diverse culture* group corparison 26 out of 48 items had a 
.aximum "factor, loading of ..55 + .10 on a factor yielding sigitificantly dif- 

ferent mean factor spores, and the identified degrees -of aberrance in the 

^ • * 

equal-culture group oortparison fluctuated widely ^with several items being 

id^tified as 'being more aberrant than the most aberrant item in the diverse- 



culture group cxxrparison. 

; " ^The factor score approach attempts to identify itans vtoch most strongly 

measure t its in vdiich the grbvps differ significantly* In large scale 

investigations, groips are likely to differ- on any measured trait incl\3ding 

the ohes intended by the test publisher and those unintentionally 'bo-ilt into 

the test. Thus a significant diff^ence in the niean factor scores on tlje 

nain test factor may be of little- interest* Differences on other factors, 

* * * • 

however,, would indicate the presence of. items vrtiich in^propriately influence 

group mean scores* In order to identify these items, the underlying factors 

of the test must be well-defined and the major factor clearly identified. 

Princj.pal ccmpoi^t Analysis using eigenvalues greater than onte and varimax 

rotation 'does not appear .to allow for ^this. Principal ootponent. analysis j 
' . r ' - • I 

yields factors which are defined by the data (as opposed to inferred) , a 

t- 

unxty eigenvalue criteria does not gxiarantee that the correct nurber of f ac- ^ 
tors^will be extracted and varimax rotation can c^fuscate the major factor* . 
A, different set of ^factor anaiytxq, procedures might have yielded more equi-. 
table results* ^ . . 

It should be noted that the factor score approach incorporates a def ini-r 
tion of item bias which is substantially different than the other approaches* 

The aj^roach seeks to identify items vrtiich neasure a trait other than chat 

'1 

nteasured by the 'remaining items of the test (by factor analyzing the combined 
data) and heavily contribute to differential performance (by oontf ibifting to' 
differential mean factor scores) * Genefically, the other approaches are 
concerned with' vAuch itemS measure di^erent traits across groi^js and opera- 
tionally with which items behave differently across grotps* This distinction 
is not as subtle as it may appear. The other approaches are incapable of 
identifying itotis which measure a trait other than that gauged by the other 



22 

s? 

, itans vA>en the groi:ps perform equit^le. • i > . 

* * ♦ « * 

The transfontBd item difficulties and' the ice thecsi^r approaches also 
inoorporate different operational defihitiOTis of bias, Hie transfomed item 
difficulties approadh identifies iters vAiich/ relative to the other itans in 
. the test, are more 'difficult for .marbers of aie group than tJiey are for mentoers 

% 

of another grov^) of escaminees. Hie ice theory approach identifies itene for 
vmich examinees of the same true ability and fron different population groups ' 
have unequal probabilities of a correct response. Thus, the transformed item 

difficulties c^roach addresses aggregate groip perfornance as indicated by • 

-'^ ^ ' 

item p-values and the ice theory approach addresses the range of item perform 
mance along the abii!^ oorjtinuum as indiacted^ by item characteristic curves. 

difference betwe^ these. two apprcacij^s is* illustrated by itace 25 
and 17 (in Figure- 4) . In the diverse culture groip comparison, item 25 was 
idoitified as biased iry the ice theory approach ahd not by the transfonted 
item difficulties approach. The overall dif ficuWy of the item for the two 
diverse-culture groips about equal. Consequently, the item igas not identified 
by the transformed item difficulties, approach. However, low ability'hearing 
iiipaired examinees and high ability hearing examinees are favored. , Uiat is, 
when considered actoss ability levels the item b^iaved differently between 
groups. Item 17, vrtiich was/ identified by both approaches, does not show this 
type of inverted differential performance. Across the ability continuum, 
hearing examinees are favored. . . 



Inserrt Figure 4 about here 



When ccrparing the transformed item difficulties and ice theory approaches 
in terms of different decision rules, five items were coranncxily identified by 



ERLC . ' ^ 



23 



both approaches. All '£iv6 of these items were of this latter .i^pe - nariiiiverted 

dif ferentieil performance, across tte ability ocntihuun. Ohis 'further illustrates 

that the transformed item difficulties approach is sensitive to differences in 

mean item difficulty whil,e the ice theory approach appears to -be sensitive to 

both mean, item dif f iclii^^ and to group performance along the.oont^muum. fiowever^ 

it should be noted that different de-f initions of item difficulty^ aiKl hence' 
! ; " ^ " ^ ' 

? mean group performance, are enployed. JUie transformed item difficultly ap- \ 

♦ «^ 

proach directly defines item difficulty from- the aggregate data. The ice 
" theory af^roach infers item difficulty f rom^perfomBnce on the item alone, 
since these different defiAitions are ent>loyed, different items were identified 
as being biased against a groip as a whole. 
Ccmclusicxis ^ , ' " 

Based the two applications^ the factor score and chi-square^ (1-^) 
X approached appeared to be inadequate for identifying biased items. The 
values in the chi-square approach were shovm to become inflated as"" total . 
observed score distributions' differ, Jiius making' significance testJiiig inapprqp- 
priate and leading to erroneous classifications of bias. The factor score 

X \ - - ' . ■ 

approach, vAiich incorporates a^^cmewhat different definition of bias, identi- 
fied large degrees of aberrahce in the equal-culture group coniparison. It 
was felt that the decisions used in factor analyziiig the data led to the un* 

* * • K * • " \ 

* \ 

Satisfactory results. It was furthei: noted- that both of these appr 
etrployed inference testing \Mch may not be appropriate with the larg^ sample 
- sizes used in this study. « ' ^ / \ \ ^ * 

'The transformed item difficulties/^the ice theory and chi-square (X^) \ 

' \ ' \ 

ajproaches appeared to be most promising. The idoitified degrees of aber- 
rance in the equal^=-culture groip was consistently Ic^ 'for these approaches, ^ 



\ 

\ 



although a liberal decision rule would have led to th^ ,f alse identification of 

ERIC , ' 



one or* ta« itens by the ice thfedry approach. ^ The first two approaches 'ideriti- 

fied severs^l iterfe in comrDn in the diverse jcitalture group conparison.. The** 

itajor difference between -these two methodologies is that tihe^ ice thfeory 

approach appears to be ^sensitive-to bias against both individuals and groups * 
> ' , ^ • * « 

of examinees" and the trajisformed-item difficulties'approach app^s to be * ' 

sensitive. to bias CHily agcdnst groijps. 'When uniform intervals are defined^ ^ 

the chirsquare (X ) approach appears tp aj^prdxiitata the' ice theory approach 

and the derived X"^ values can be* used as indices of relative bias. 

Reoonnendaticns " ^ - • 

Ohe investigation utilized a single set'pf diverse culture group *data 
for v4iich the item parameteris were- iaikncwn a^priori.^ While there wis dub- ' 
stantial' reason to suspeb^-the presenc^pf some biased- items, the true ^nuniSer 
of biased items, their amounts.of aberrance and' their item nurrbers ware un- . 

A similar study xosing simuJated data with kxK^m parameters may prove, 
revealing. Such a study could iilso investigate the behavior of the approaches 
under different nunters of biased itansi . ' ' - 

One of the rrtore promis@g aixi interesting approaches to'the detection of 
biased items, the distiractor response analysis (Veale and. Forertan, ^ 1975, 1976; 
Maw, 1977) , was not evaluated' in this*stui^ - due to the lack of the^ai^rop- ' 
.jprlate item response data.. Rather than analyzing the nxanterp -of. examinees • 
responding correctly, -this aj^jroadi identifies differences in distractor 
respons^ patterns. Although the approach iSoorpbtates inference testing, it 
iray prove beneficial to the field- and should be considered in future investi- ^ 
gatiais of item bias detection methodologies; ^ . , ' ' / • , ^ 



25 



REFERENCES 



Angoff^,W. H, A^tedhnique for the investigation of cultural differences . 
Paper presents at the anpual ineejbing of the American Psychological 
^^^f^sociation , Honolulu, May 1972. - ^ - ^ , 

Angoff, >H., & Pord, ?• Item-race interaction on a test of scholastic 
aptitude.* Journal «of Educational' rteasureroent , 1973, 10, 95-105 

• . • <^ 

Angoffv wr H., &• Modu, C,„C. Equating the scales of the Prueba de ^ptitud 

- Academic^ and the Scholastic Aptitude Ttest . New York: College Entrance 
Examination Board, 1973* '* \ <=, ^ ^' 

Bimb^un, Av** Sane latent tf^t models -an^ their use in inferring an examinee •s 
ability* -In F, M, Lord & M** R, ^Novick, Statistical Theories of Mental 
Test Scores . . Reading, MA:' Addison-We^ley, 1968, Oiapts. 17-20. ' * 

Cardall,* C. & Coffman, W. R, • A ne^od for ccitparing performance of different 
groups on the itans in attest/ (RM 64-61) Princeton: Educational ^ 
^ Testing Service,, 1964. ^ . , • ] 

Cleary,^ T. A./ & Hilton,, T. L. An investigation into item bias. Educational 
amd,. Psychological Measurement , .1968, 61-75. 

Cole, -N. S. Bias in selectfon.* Journal of Educational Measurement , 1973, 10, 
^ . 237-255. 

Darlington, R. B. Anotheit look at jCcaltural fairness." Journal of ^Educationa l 
Measurement , 1971, 8^, 71-82. ^ " . 

Echtemacht, G. A c|uick roeth(5d for deterndning test bias. Educational and 

Psychological Measxxcement , 1974, 34, 271-280. 

«i - - ^ 

Eells, K., -'Davis,, A., Havighurst, R. J. T derrick, V. E., fie' Tyler, R. W. Intel- * 
ligence>and Cultural Differences . Chicago: University of Chicago Press, 
'1951.- - 

Einhom, H. J., ^ Bass, A. R. Methodological considerations relevant to dis- 
crimination in enployment testing. Psychological Bulletin ^ 1971, 75(4) , 
261-^269. 

Green, Dc R.,'& Draper, J. F. Explorarory studies of bias in achievonent 
tests . Monterey: CTB/McGraw-Hill, 1972., 

« 

Gross, A. L. , & Su, W. - Defining a fair of unbiased selection model: a question 
of utilities. Journal of Applied P^chology , 1975,' 60, 345-351. 

Guilford, J'. P. Psychcmetric methods . New York: McGraw-Hill, 1954. 

Hambleton, R. K. & Cook, L. L, Latent trait/models and their use in the 

analysis of educational data. Journal of Educational Measurement, 1977, 
14(2), 75-96. 



27 



Hicks, M. M. , Donlor>^ T. F. & Wallmark, M. M. Sex differences in iton responses 
on the ^Graduate Record Examination , Paper presented at the annual ineeting. 
of the National Council on Measurraient in Education/' San Francisco, 
April 1976. , * ^ 

Jenserta,, C; J. An ^ylication of latent trait mental test theory to the . ' 

Washington. pre-college testing battery ♦ Unpublished. Doctoral Dissertation, 
University of Washington, 1972. ^ ^ 

Jensen,' A. P An examination of culture bias in the Wbnderlic Personnel Test , ' 
Arlington, VA: Eric Clearinghouse, 1973, (ERIC Docunent Reproduction u 

Service ED 086 726) . . / - . 

r 

Lord, F. M. A theory of test scores* ' > Psychonetric Monograph Number 7 . 
Princeton i Educational Testing Service^ ^ . ♦ . 

Lord, F. M., & NoVick, M. R.. Statistical Theories of Mental Test Scores- (2nd 
Ed.). Reading, * Addison-Ifesley, 1968. " . 

Lord, F. M. A study of iteni bias using iton characteristic curve l^ieory . 

Proceedings of the Third Caigress of Cross-Cultural Psychology, Tilburg, 
Hollahd, 1977. \ ^ * 

Maw, C. E. Item response patterns and groip differences: an a^lication of 
the log- linear model. Unpublished doctoral dissertation. University 
of Chicago, 1977. * ^ ^ ' 

Merz, W. R.^ Factor analysis as a techniqtie in analyzing test bias .. Paper ' / 
presented at the annual meeting of the California Educational Research 
Association, Los Angeles; 1973. 

* 

Merz, W. R. Estimating 'bias in test items utilizing principle ccnponent 

analysis and the generail linear solution . Paper presented at the annual 
nieeting of the American Educational Research Association, San Francisco, 
i^ril 1976. ^ 

Merz, W. R. Test fairness and test bias: a review of procedures. In M. Wargo 
and D. R. Gr^n Achievement TestiJig of Disadvantaged and Minority, Students 
for Educatioml Program Evaluation . New York: McGraw-Hill, 1977, in press 

Pine, S. M. "Application of "Oitem Characteristic Curve Theory to the Problem 
of Test Bias," in Weiss, D. J. (Ed.) . Applications of Coiputerized 
Adaptive Testing , Minneapolis: University of Minnesota, Dfepartment 
of Psychology, Psychonetrdc Jfethods '-Program, Octob^, 1976. * 

Rasch, G. Probabilistic Models for Sane Intelligence and Attainment Tests . 
Copenhagen: Denmarks Paedoggiogishe Institute, 1960. \ 

Rudner, L. M. An approach to biased item identification usin^ latent .trait 
measijrenent theory . Papfer presented at the annual meeting of the 
\ American Educational Research Association, New York, April 1977a. 

Rudner, L. M. Efforts toward the development of unbiased selection and 

assessment instruments . Paper presented at the Third International- 
' Symposium on Educatioial Testing, University of Leyden, ^e Netherlands*, 
June 1977b. * , / 



Rudner, L. M. Item Bias vdth Deaf ard Hearing Examinfees. Volta Review ^ 
1977c, in press. , 

Schnidt, F. L., & Gugel, J. F.^ The Urry Item Parameter Estimation Technique ; 

liow Effective? Paper presented at the Anericanr Psychological Association 

Convention/ Chicago,^ August 1975, ' ' 

' * . ^ '-^ 

Sdieuneman, J.- A new method of assessing bias in test items . Paper presented , 

at the annual meeting df the American Educational .Research Association, 

Washington, April 1975. , 

♦ 

Scheuneman/ J. A procedure for evaluating itsm -bias in the absence of an 

. outside criterion . Paper presented- at^ the annual meeting of the Ainerican 
' Educational Research Association, Francisco,' April '1976. 

-Stokoe, W. C. The study and use of sign language. Sign Language Studies , 1976, 
\ 10, 1-36. . • . . 

Strassberg-Rosenberg, B. , & Dailon, T.' F. Context influences on sex^ differ- 
. eiices in performance and aptitude lests . Paper presented at the annual 
meeting of the National Council on Measurement in Education, Washington, 
" D.C., 1975. , " . 

Thomdike, H. Concepts of cultural fairness. Jtoumal of Educatiqnal 

. Measureitent, 1971, 8, 63-70. " ' , ^ ' 

Urry, V. W. A Monte Carlo investigation of logistic mental test models . 

' ^ Ur^ublish6d .Doctoral Dissertation, Purdue University, 1970. ^ 

. Uirry, V. W. Ancillary estimtors for the parameters of mental text models . 
' Paper presented at the American Psychological Association Convention, « 
Chicago, Aiigust 1975. - " ' > ' ^ 

Veale, J. R., & Foreman, 'D. I. Cultural validity df ite[(\s and tests; ^A new 
approach .' Score Technical Report, Iowa' City, Iowa: Westinghouse ^ - 
Learning Corporation/Measursnent Research Center, 1975. 

Veale,' J. R.,|& Foreman, D. I. Cultural variation -in criterion-referenced 
tests; a "gobal" item analysis . - Paper presented at the- annual . 
meeting of the American Educational Research' Association,. San Francisco, 
April 1976. 

Williams, R. L., Mosby, D., & Hinson, V. Critical issues in achievement 

testing of children fron diverse ethnic backgrounds. In M. Wargo and 
D. R. Green, Achievement Testing of Disadvantaged and Minority Students 
fo r Educational Program Evaluation . New York: McGraw-Hill, 1977, in 

* . pitiSS. • . ^ • 

Wingersky, M. S., & Lord, F. M. A ccxtputer program for estiitating examinee 

ability and item characteristic curve parameters- when there are omitted 
respoises . '(RM73-2.) Princeton: Educational Testing Service-, 1973. 



•.. - TABLE 1 

Degrees of Aberrance Identified by .the Approaches 
in the Diverse-Culture -Group Cqmparison^ 



Item 


, . itc ' 


Transformed ^ 


Chi 


Chi 


Factor 


# . 


Area 


Item . 
difficulties 


Sauare 
(1-p) 


Square 
( X2 ) 


Score 




1 


.^0 


.24 


• .98 


5i9 


.35 . 
,53^ 


*2 


.07 


- .^Sl . 


■ .999 


' 33.1 


3. 


^ .29 


0 .13 . ■• 


..87 


' ' 8.5 


,55 


4 


.75 


.79 .. 


-.999 


54.2 . 


,45 . 


' 5 


. 25,., 


.21 


.99 


11.1 


,6.3. 


6 


.17 


.18 


' .89 




,40 


7 


.15 


.43 


.99 


li.9 


',45' 


8 


•50 


.54 


. .999 


27-. 9 p 


,46 


, 9 


" , .27 


.14 


.99, 


12.6 


,35 


. 10 


.24- 


. \ .'46 


. .99 . 


. u.i- 


,42 


11 


, " .34 


.54 ; 


,999 


42.8 


,62 ■ 


12 


.37 > 


.52 


.999 


43.6 • 




13 . 


.11 


.52 


.-^99 


.'55.1"' 


\52 / 


i"4 


.16 


.05 


..60 


. 3.0. • 


- ,28' J 


15. 


■ .25 


.68 


.999 


10514 . 


• ,42 ! 


16 


'.5r 


• 1.11 


.999 


'l07o"? 


. ,61 / 
,65'/ 


17 


- .76 , 


1.25 


.999 


159.0 


18 


.83 


.85 


.999' 


■ ' 27.7 


.26/ 


19 


.37 ' 


.23 


. .999 


30.7 


\30 


20 


.16 


, .1? 


.99 . 


/ 14.8 
./ 14.4 


.,36f 


21 




.44 ■ 


.99 




22 


2^30 


.67 


.999 


240.9 


.52,' 


. 23 • 


< ' • • .38 


.67 


.999 


31.8 


,2i 


24 


.61 


.5,1 


.98 


10.2 


. ,53 


25 


1.01 


.08 


."999 


49.5 


,57- 


26 


.38 


.67 I 


.999 


94^.8 * 


/- ,60 


27 


.04 


.76 


.999 


■ 65.2 . 


,48. 


28 


.i2 


.18 


' .96 


8.2 


,55 ■ 


• 29 


.29 


.44 


1999 , 


65,4 


,3-4 


30 


.23 


1.05 , 


.999 


. 122;,3 


,52 


31 


.13 


- .07 


.999 


26,0 


,36 


32 


.19' 


.01 


.65 . ' 


4*2 


,27 


33 


.14 


.15 


.99 


13,7 


, .44- 


34 


.15 


.05 


.96 


8, '2 


.,17. 


35 • 


.14 


.66 


.999 


. 17,7 


J33 


,36 * 


.09 


.17 ■ 


.999 


33,6 


,22 


37 


o07 


.32 


.18 


,9 


i26 


38 


.14 


.43 


;999 


34,7 


b20 


39 


.23 


.14 ' 


.99, 


15-, 1., 


,36 


40 


.08 


.37 


.999 


23.4 


,44 


41 


.27 


.16 


.60 


2,8 ' 


■• ;,51 


42 


.27 


•• .33 


.60 


2.9 


.46 


43 


.07 


.■16 


.999 


^ 2 2 o 8 


. 4,6 


i 44 • 




.26 


.999 


133,2 


,48 


45 


.55 


.04' 


.999 


85,1 


.49 


46 


.25 


.16 


.99 


L3.4 


,51 


47 


.60 


.21 


.88 


6,1 


..57 


48 


.34 


o24 


.999 ' 


33,1 • 


.44 






30 










^ - 




Oi - — 




} 



/ 



Table 2 



- Gpi^reiations of tlie Degrees of Aberrance Identified 
•by the Approaches in the Diverse Culture Gro)xp Ccxiparison 



^ Transfonned 
"item difficulties 



Icc theory 



.31*. 



Chi' 



-Square 
(X2), 



.67' 



Chi-Square Factor 
(1-p) score 



.17 



.28 



Transfonted item 
difficulties 



.59 



.29 



.30 



Clii-square 
{X2) 

Chi-squarl 
(1-p) 



-.31 



«34 
.23 



* p <■ .05 
** p < .01 



J. ; 



TABliE- 3 

Degrees of Aberrance Identified' by ^ the Approaches 
the Equal-Culture Group Comparison 



Item 


. ICC 


' Transformed 


Chi- 


Chi- 


Factor 




Area 


Iteiu 


Square" 


Square 


Scor'e 


- • 





Difficulties 


(1-P> 


( X2 •) 




1 


.12 


.02 


.32 


" 2.4 


.19 


z 


. .15 


.02 


.22 


1.7, 


.07 


3 


.10 


.1-6 


.05 . 


.5 


■ ..16 


4 


.06 


.06 


.32 


2.4 


.36 


■5 


.08 


as 


.01 


.1 - 


.07 


6 


.28 


.14 


.48 • 


3.3 


.06 


7 


o24 




.08 


-.9 


.26 


8 


.19 


.03 


.01- 


.2 


.02* 


9 


,.19 ' 


.08 


.52 


3.4^ 


.32 


10 


.08 


.02 


, .28 


2.1 


.09 


11 


. .18 


.00 


f . 0 3 


. .5 ■ 


.1,9 


12' 


:i7 


. , .11 


• -.28 


2.1 


.14 


13 


- .04 ■ 


.13 


^1-01 


o2 


, .19 


' 14 


' .21 


.12 


.12 


1.2 


.20 • 


15 . 


.04 


.07 


.18^ 


1.6 


.26 


16 


.22 


.03 


.40 


2.6 


. .13 


17 


.31 


.15 


. .48 


3.3 - 


.20 


18 


.26 


.07 


.08' 


.9 


.57 


19 1 ' 


.32 


.03 


.68 


■ 4.8 


.20 


20 


.24 


.04 


.15 


1.4 


.46 


21 




- ^28' 


. 68 ■ 


4.7 


.11 


• 22 . 


• .17 


.05 


. m40 


2.6 


.06 


' 23 


.34 


.14 


.06 ' 




.15 . 


. 24 


.19 


.21 


.09 


1.6 


' ;20 


25 


.36 


.'09 


• .6^8 


, 4.8 


.08 


26 


.21 


.01 • 


.03 


.6 


.17 . 


27 V 


.11 


.02 


• .07 


.8 


.40 


28 ' 
29 


.51' 
• .11 




.59 
.26 


3.8 
2.0 


■ :i4 . 

.40 


30. , 


.14 


.14 




3.7 ' 


.19 


31 


.09 


.10 






1 .14 


32 


.07' 


. .03 


-.31 . 


• 2.^3 


'.24 


33 


.34'- 


.12 


.78 


5.6 


.25 


34 ■. 


.14 


'.13 


.20 . 


1.7, 


.72 


35 


.12 


.21 % . 


. .. .73 


° 5.3 


.70 


36 


o22 


.18 


' .07 


o8 • 


5 , .72^' 


37 


.06 


.15 


y .26 ' 
/ .48 


• 2.1 ■ 


' .'63 


^ 38 


.23 


■ .09 


3.3 


..34 


•39 • 


, .74 ■ 


.16 ^ 


•.88 . 


7.6 


' .10 


40 


.38 


.06 


;47s 


^ . 3r2 


.20 


41 


.35 


.14 


.81 


6.5 . 
3.5 


.08. 


42 


.37 


. 05 


• 52 


• 11 


43 


.29 


.08 


°.12 


^ '1.2 


' .11 


44 




• .12 


.31 


2.3 


.08 


45 


.34 


.06 


.4,8 


3.4 


.'■48 


. 46 


.14 


.ao 


.07- 


..8 


.51 


• 47 ; 


.32 


.07 


.08 


.9 


.68 


48 


.26- 


.16 


.68 


4.8 


. .43 



5 

32 



TABLE 4 ^ 

Items classified as biased (***) by 
three approaches -under select decison 
rules in the diverrse-culture group comparison 



ITEM ICC . TRANSFORMED CHI- 

# THEORY ' " ITEM SQUARE 

DIFFICULTIES (.X2 ) 



26* 



1 . . - ; . - 

2 

3 - , - 

4 *** *** . 

5 • - - ■ ■ ■ - 

6 - - - - 

7 - ■ - 

8 *** . - . - 

9 « 

10 - ' . — - 

11 - - 

12 - - 

13 - , - . - 
14 

2^5 „ *** *** 

*** *** ' - *** 

1j ***'■ *** \ *** 

18 *** *** -V 

19 • - 

20 ' - 

21 • . - , - 

22 V *** -^i ***^ *** 

23 - • ***^ ' 

24 *** - • - 

25 , *** - - 
^ *** ' *** 



*** *** 



_ *** 
*** . *** 



27 - . 

28 
29 
30 
31 

32 " - 

33 
34 
35 
36 
37 
38 

39 - 

40 . ■ - , 

41 - ' • 
42 
43 
44 

45 *** - ,*** 

46 u " - - ^ ■ 

47 *** •> m. 

48r . - 



*** * 



^ Table S, 

^Hypothetical Item Response. Distributions by Total 
Score Levels, Within a -Single interval • 



N in each 
total score level. 



N 







• • 


Group 1 


Group 2 


, ' Groip 1 


Group 2 


N 


total • 


10 


200 


10 


40 (20%) 


' 2 (20%) 




score, 
level 




160^,. 


- 30 


^ 48 (30%) 


9 (30%) 




1.2 


120 


50 , 


48. (40%)^ 


20 (40%)' 


% ' 

V 


1 » 




480 


, do' 


='136 ■ O2 


= 31 

< 

1 



o 

ERIC 



34 



Diverse^Culture Groups 



Equal -Culture Groups 



c 



■lllliliHill. 



Ill 



ii 




Transformed 

Item 
Difficulties 



ICC 
Theory 



Chi-Sqiiare 

(x^) 



C)ii-Square 



0) 

u 



. 1 llJh.n niuJA to.iii'i.1,ni,r.!l 



I II II 



M MM** 




I N tl M M «» «i 



u ^ 




tl It n A M M 




II II N tt H N 

Itejn Number i 



Factor 
Score 



I. 



0) J 



I 
I- 



I tl It M II H II •» »$ 



Item Number 



Figure^ 3: Plots^ of the degrees of aberrance identified by each approach 
for jeach group comparison. ^ ^ 



■< 



8. 



Item is 



f 



• / 



Hearing Exaitiinees 



7 * 



/ 



~ Hearing Inpairad 
Examinees — 



'2. 



THCTA 



r 
2 



I 

CD 
< 

CD 



Item 17 



— Hearing Examinees 



• — Hearing Inpaired 
Examinees 



8j 



-1 



-2 



Figure 4: Estimated equated ice's for items 17 and 25 
in the diverse-culture group comparison. 



